Accelerating Complex Data Transfer for Cluster Computing
نویسندگان
چکیده
The ability to move data quickly between the nodes of a distributed system is important for the performance of cluster computing frameworks, such as Hadoop and Spark. We show that in a cluster with modern networking technology data serialization is the main bottleneck and source of overhead in the transfer of rich data in systems based on high-level programming languages such as Java. We propose a new data transfer mechanism that avoids serialization altogether by using a shared clusterwide address space to store data. The design and a prototype implementation of this approach are described. We show that our mechanism is significantly faster than serialized data transfer, and propose a number of possible applications for it.
منابع مشابه
Adaptive Dynamic Data Placement Algorithm for Hadoop in Heterogeneous Environments
Hadoop MapReduce framework is an important distributed processing model for large-scale data intensive applications. The current Hadoop and the existing Hadoop distributed file system’s rack-aware data placement strategy in MapReduce in the homogeneous Hadoop cluster assume that each node in a cluster has the same computing capacity and a same workload is assigned to each node. Default Hadoop d...
متن کاملA Novel Scheme for Accelerating Support Vector Clustering
Limited by two time-consuming steps, solving the optimization problem and labeling the data points with cluster labels, the support vector clustering (SVC) based algorithms, perform ineffectively in processing large datasets. This paper presents a novel scheme aimed at solving these two problems and accelerating the SVC. Firstly, an innovative definition of noise data points is proposed which c...
متن کاملGPU Cluster for Acceleration of Scientific and Engineering Applications in the Context of Higher Education
Many fields of research now rely on High Performance Computing (HPC) systems which can process ever larger datasets, with increasing accuracy and speed. Many universities now provide a HPC service. Following the trend over the past few years of the worlds fastest supercomputers being accelerated using Graphical Processing Units (GPUs), there is a growing interest in the use of GPUs in Higher Ed...
متن کاملFlexible Intermediate Library for MPI-2 Support on an SCore Cluster System
A flexible intermediate library named Stampi for MPI-2 support on a heterogeneous computing environment has been implemented on an SCore cluster system. With the help of a flexible communication mechanism of this library, users can execute MPI functions without awareness of underlying communication mechanism. In message transfer of Stampi, a vendor-supplied MPI library and TCP sockets are used ...
متن کاملEntropy-based Consensus for Distributed Data Clustering
The increasingly larger scale of available data and the more restrictive concerns on their privacy are some of the challenging aspects of data mining today. In this paper, Entropy-based Consensus on Cluster Centers (EC3) is introduced for clustering in distributed systems with a consideration for confidentiality of data; i.e. it is the negotiations among local cluster centers that are used in t...
متن کامل